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Abstract 

Gauge fixing is a frequent task encountered in practical lattice 
gauge theory calculations. We review the performance character- 
istics of some standard gauging procedures for non-Abelian gauge 
theories, implemented on the parallel machines CM2 and CMS. 
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1 Introduction 



Large-scale computer simulations have become an important tool in the theory of 
elementary particles. There is a very rich empirical material to test quantum chro- 
modynamics (QCD), which is held to be the basic theory of strong inter actions [|1|. 
Most of it is related to low energy data such as the spectrum and the structure of 
hadrons, where nonperturbative methods such as lattice gauge theory (LGT) in four 
dimnsions are indispensable]^]. 

In many practical situations of LGT it is necessary to perform gauge fixing on the 
discrete lattice. This is in particular the case if one is interested in the computa- 
tion of gauge-noninvariant quantities such as gluon propagators^^, Bcthe-Salpeter 
amplitudes^ or monopole densities|0]. Gauge fixing might also be helpful to reduce 
noise in the measurement of gauge invariant quantities (such as Polyakov lines or 
Wilson loops) or in the context of preconditioning |l( 



Various methods have been proposed to achieve gauge fixing [1C][13]. They are 
based on iterative schemes, which might need many iteration steps and thus tend 
to be rather time consuming. It is therefore of considerable interest to study their 
convergence behaviour and the efficiency of their implementations on parallel com- 
puters. 

In this paper we will present an introduction into the issues and compare two such 
algorithms for Landau gauge fixing in lattice QCD, on our local connection machines, 
CM2 and CMS. We will also consider some variants of these basic algorithms, that 
aim at possible acceleration gains. 



2 Gauge fixing on the lattice 

QCD gauge fields are defined on the lattice in terms of parallel transport operators, 
U^{x) G SU{3), fi = 1..4, that live on the links between neighbouring sites x, x + jl 
of a fourdimensional hypercubic lattice and are related to the continuum fields Af^{x) 
by ^ 

= — {U^{x) - f/^(x)) Itraceless ■ (1) 

U^{x) are matrices in a threedimensional 'color' space, and the set {f/^} is called a 
configuration. Under a local gauge transformation U^{x) transforms like 

U^ix) U^ix) ^""^ = Gix)U^ix) Gix + fx), (2) 

where the matrices G are elements of the gauge group SU (3) and are associated to 
the lattice sites. 
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In the continuum formulation the local Landau gauge fixing condition, df^A^^ = 0, 
can be viewed as the solution to the global variational problem of minimizing the 
integral / d^xA^A^. The gauge fixing condition in its global form can easily be 
transformed into a lattice relation, in terms of U^{x) : 



F{G):=ReTr ^t^i^) = Extremum. (3) 

fj, X 



Obviously, F{G) has an upper and a lower bound. Accordingly, there exist both 
an absolute minimum and maximum. Notice that both maximizing and minimizing 
F{G) will satisfy condition (3). Hence we have at least two Landau configurations 
which differ by gauge transformation. This is due to the fact that, for SU{?>) gauge 
theory, the Landau condition is not sufficient to fix the gauge completely, which 
leads to the notorious phenomenon of Gribov copies 

We consider next a given configuration to be cast into Landau gauge. This will 
be achieved by driving F to an extremum, in an iterative process. For constructing 



G{x) , we will follow the algorithms introduced in refs. [|TO||[T^ , which will be referred 
to as Cornell JIU I and Los Alamos [P^ methods. 



a. The Cornell Method 

In a convergent scheme, the local gauge transformation G{x) is expected to approach 
unity with increasing iteration number i. The Cornell method therefore starts off 
from the ansatz 

= e~'°^M^-^''(^), (4) 

which is expanded to first order: 

G{X)^l-iaY,d^A^'{x). (5) 



This approximation requires a subsequent projection of G{x) back into the group 
space. A^{x) is given by eq. (1) and the lattice form of d^A^{x) is 

d^A^'ix) = A^{x) - A^{x - fi). (6) 

The quantity a is a real parameter to be suitably chosen, depending on the under- 
lying lattice. 

b. The Los Alamos Method 

One introduces 

w{x) = J2U,{x) +U;{x-fi) (7) 
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and rewrites eq.(3) as 

F{G) = lReTrY^T.Ul^\x) + Ul{'^^{x-il) (8) 

=: ^ReTr ^ (9) 

xdred or black 

The basic idea is now to construe G{x) such that F changes its size monotonically 
from one iteration step to the next. This construction starts from a checkerboard 
(red-black) subdivision of the lattice, with G{x) being equal to unity on the red 
(black) sites at even (odd) iteration steps. The local gauge transformation now 
simplifies to 

U^{x) G{x).U^{x).l (10) 

U^{x-fi) l.U^{x- fi).G+{x) (11) 

and can be carried out in parallel on the entire lattice. This iteration step can be 
recast into the simple form 

w{x) — y w^{x) = G{x) w{x). (12) 

The non-unity part of G{x) is chosen to be a projection of w{x) onto the group 
manifold that obeys 

Re tr G{x)w{x) > Re tr w{x). (13) 

Note that the change in F is due to independent local changes in w{x). For this 
reason one has to interchange the role of red and black points after each step. 

In the following we will study the convergence behaviour of these basic methods 
in conjunction with suitable acceleration techniques, implemented on the massively 
parallel machines CM2 and CMS. We present results obtained on 8 gauge configu- 
rations of a 8*^ hypercubic lattice, equilibrized at /3 = 5.7, as well as 2 configurations 
of a 16^ lattice , at /3 ~ 6.0. These lattices were generated using a combination of 
heat bath and overrelaxation sweeps. The configurations are separated from each 
other by 1000 sweeps. 

3 Techniques 
3.1 Reunitarization 

We have seen that proper unitarization is an important feature of gauge fixing 
iterative schemes. The Gram-Schmidt orthonormalization scheme (GS) works very 
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well for the Cornell method but not at all for the Los Alamos algorithm, as it does 
not ensure the validity eq. (13). This can be achieved by the projection method of 



maximal trace (MaxTr), which is based on the Cabbibo-Marinari trick p 



The projection G{x) of a 3x3 matrix w{x) onto the SU{?>) group is computed iter- 
atively with the recursive step 

and the initial condition Gq{x) = w{x), where 
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the elements in each matrix A], equal divided by the corresponding scaling 



factor Nl. 



By construction, Gi G SU{3),i > 1 and Tr(Gj+iGo) > Tr{Go). We achieve maximal 
trace (within our 64 bit machine accuracy) after about iV ~ 5 — 7 steps, and use 
Gn for the projection of w. 



3.2 Convergence criteria 

We have to define an appropriate quantity that can serve as a monitor for the quality 
of gauge fixing achieved during the iteration process. Our aim is to minimize the 
quantity 6 =\ extremum{F) — Fi \. As extremum{F) is not known during the 
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iterative procedure, we have to use other quantities than 6 to judge the convergence. 
In the hterature two possibhties have been considered^: 



1. One can use [|T0| 



0"! 



(14) 



as a direct measure of fulfillment of the Landau gauge condition. L is the 
lattice size, and, for the SU{?>) gauge theory, ric = 3. 

2. As Gj — > unit operator with i ^ oo the average trace of the gauge matrices 

1 



(79 = 1 



Re Tr J2 



X] 



(15) 



can serve as an alternative monitor for convergence WA 



As F is quadratic in all variables, we propose here as a third possibility to employ 
the rate of change in the iteration step Fi — > Fj+i as a criterium for convergence 
achieved at step i 

as = Fi+i - F,. (16) 
Notice that F behaves monotonically. 
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Figure 1: 

left: The convergence behaviour in maximizing F, using the Cornell method and 
reunitarization procedure of maximal trace, right: The same for Los Alamos method. 



^In [|6|, extremum{F) is appearently estimated by the last F/v where N is the last iterative 
step. This is however a very misleading quantity 
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400 800 400 800 

iterations iterations 

Figure 2: Same as in Fig. 1, hut with signals appropriately rescaled. 

We see from Fig. 1 that the three signals differ considerably in size. One must thus 
be careful when comparing the quality of gauge fixing quoted by different authors. 
Nevertheless, to the degree that the signals aj provide useful measures for the dis- 
tance 6, we would expect them to show similar shapes during the iterative process, 
i.e. coincident positions of maxima and minima. This is indeed the the 
signals can nearly be made to coincide by appropriate rescaling (see Fig. 2). 



So far, we have considered gauge transformations for maximizing F{G). In order to 
drive the system into a minimum of F, we simply revert the previous construction, 
using = instead of G. We see in Fig. 3 that the convergence behaviour of 
the maximizing and minimizing procedures is very similar during the first several 
hundred iterations. After that, the minimizing iterative scheme starts to oscillate 
badly, with poor convergence compared to the maximizing case. Recall, however, 
that the minimization procedure yet rendus a monotonic behaviour in F. 



4 Accelerating Procedures 



The discussed algorithms, which are basically local, perform sufficiently on small 
lattices, such as 8^. On larger lattices, however, they show poor efficiency, due to the 
phenomenon of critical slowing down which occurs when the convergence modulating 
matrix carries a large range of eigenvalues. Various methods have been proposed to 
speed up relaxation, like preconditioning in the Fourier space|]lO|, overrelaxation |ll2[ 



or multigrid schemes |]I^]||16||. We want to comment here shortly on some of these 
methods as implemented in data-parallel computing. 
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Figure 3: cxi obtained by maximizing and minimzing F using different reunitarisation 
methods. After several hundred iterations, it beginns to oscillate widely. 



Fourier acceleration |T0| We consider the convergence behaviour of the Cornell 
method which it is controlled by the matrix ^dfj_A^. In momentum space, the 
modes of this matrix converge with relaxation times proportional to jp. Thus the 
overall relaxation r time will be determined by the smallest momentum states: 
T ~ 1/Pmm — ■ To overcome this critical slowing down, one modifies the gauge 
transformation matrix G{x) in Fourier space such that all modes converge to zero 
at the same rate (as fast as the fastest mode): 

G{x) = e-^ — ^^^^ 
where F denotes the Fourier transform. 



In Fig. 4 we see that this preconditioning reduces the required number of iterations 
by a factor of about tw(^ 

Note however, that the cost of the Fourier transformation is non-negligeable on 
parallel machines, due to its wide range communication requirements. We used the 
fast Fourier transform (FFT) subroutine, as provided by the scientific subroutine 

^Nevertheless this result is not as expected from ref which quotes a gain factor of 7 ! 
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library (CMSSL) on the CM2. This subroutine requires a special ordering of parallel 
data, the so called SEND ordering. Our axis are, however, NEWS-ordered. In this 
case FFT carries out an internal reordering from NEWS to SEND data structure, 
prior to the actual computation of FFT[|l^. As a result, the iteration step on the 
CM2 is slowed down by a factor 15, through the two FFT steps, Eq. 17. 
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Figure 4: The effect of Fourier acceleration on the convergence behaviour for the 8^ 
lattices. 



Overrelaxation Overrelaxation methods are widely used in improving convergence 
of iterative methods, as well as in overcoming critical slowing down in Monte Carlo 
updating. In ref.|jl2|] Mandula and Ogilvie applied overrelaxation ideas to lattice 
gauge fixing. They replaced the gauge transformation matrix G{x) by an infinite 
series 



n=0 



where 



Tiu + 1) 



n] 



(19) 



The overrelaxation parameter uj can take values between 1 and 2. The optimal 
choice is to be made empirically^. For our 8^ lattices we found that the best value 
for UJ lies near 1.45 (cf. Fig. 5). We truncated the series after two terms (this 
corresponds to the original form introduced by Adler fl^)- Calculations with higher 
order terms showed similar behaviour. It is worth mentioning that the overhead of 
the overrelaxation is very small. 

Stochastic overrelaxation [|l^] In this method one applies a local gauge transfor- 
mation G{xY with probability p, instead of always applying G{x) . For p = 1 the 

where L is the lattice size 1131 [0 



Approximately, the optimal choice oJopt 
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Figure 5: ai for different overrelaxation parameters. 



procedure definitely diverges. This "go wrong once in a while" principle has the 
capability, however, to render a considerable speed up of convergence of the two 
gauging methods treated in this work. The actual acceleration gain turns out to 
depend strongly on p. For our 8^ lattices we found best convergence to occur at 
p ~ 0.9. 
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Figure 6: Efficacy of stochastic overrelaxation as a function of the probability p. 



5 Performance on CM2 and CM5 

The Connection Machines in Wuppertal The CM2 at the University of Wup- 
pertal is an 8k machine with 256 64-bit floating point accelerators and 256 MByte 
memory. Data 10 is performed on a disk parallel storage system (Data Vault), with 
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10 Gbyte capacity. The CMS is configured as a 32 processor nodes machaine with 4 
vector units to the node, and a 16 Gbyte scalabale disk array (SDA). The CM5 can 
be used in both message passing and data parallel models. Given the 4 dimensional 
hypcrcubic structure of our lattices, we implement the gauging procedures in the 
data parallel model. 

Implementation In the following we will describe some features of our CMFOR- 
TRAN implementation on the above CM2 and CMS machines. 
First, we have to establish a proper distribution of our data (gauge fields {U}) 
among the processors. This has to be chosen to minimize data traffic. The data 
layout has of course a direct influence on the subgrid structure, that the CMF com- 
piler produces. This is especially important on the CMS because its performance 
is heavily constrained by communication. Physicswise, we are dealing with nearest 
neighbour interactions. Technically, on the CMS, our nearest neighbour interactions 
are mapped - by the cshift operation within CMF - onto data movements which 
occur in-processor, on-chip and between-chips, in decreasing order in speed. It must 
therefore be the main goal to attain a layout that minimizes the amount of off-chip 
cshift operations. That implies a subgrid geometry, for which most cshifts are to 
be performed on the longest axis. This is achieved by assigning suitable weights to 
each axis. 

The Cornell method, for example, is obviously isotropic in the sense, that all space 
time directions are equivalent with respect to the amount of shift operations re- 
quired. It is therefore natural to select the "canonical" layout for the gauge field 
{U}, with all space time axis declared to be parallel with equal weights. The two 
matrix (color) indices of {U} and the Lorentz index /j, are chosen to be serial, which 
leaves us finally with the array structure 

jjicornell) ^ jj (^^^^ ^ 2., L) 

cmf%layout U{: serial, : serial, : serial, : news, : news, : news, : news). 

nc = 3 is the dimension of color space, ni = 4 corresponds to the range of the 
Lorentz index and L is the linear size of our hypercubic lattice. 

The Los Alamos method, on the other hand, induces an asymmetry into the code, 
due to the red black splitting. We map the entire lattice onto the red (black) part 
using a restricted lattice with a geometry of the form {L/2, L, L, L). In order to store 
all the links on such a lattice we double the range of one particular serial index, which 
wc choose to be the Lorentz index, ji. Due to the resulting asymmetric geometry, 
one expects to enhance the performance in assigning appropriate weights to the four 
parallel axis. We found that the best performance is achieved by assigning a relative 
weight of two to the short axis. As a result, we work with the array 

U{LosAlamos) ^ 2ni,^, L, L, L) 

cmf$layout U{: serial, : serial, : serial, 2 : news, : news, : news, : news). 
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Performance Data For clarity, we restrict ourselves in the following to quoting 
measurements from codes produced by the CMF compiler (release cmf 2.1 beta 0.1 
on the CMS), using complex arithmetic in double precision. On the CM2, we are 
running the slicewise CMF compiler (cmf 1.2). It goes without saying, that there 
is room for improvement of the pure FORTRAN code by resorting to lower level 
language programming in some kernel routine^. 

We find that the local SU{3) multiplication has a performance of about 1.2 Gfiops 
(370 Mfiops) on our CM5 (CM2) for a 16^ lattice. On the CMS, this corresponds 
to about 30% of the peak rate of 4 Gfiops. The reason for this rather low perfor- 
mance lies in the fact that the present CMF compiler does not yet produce optimal 
complex arithmetic for the vector units: it translates a series of complex number 
multiplications into a code with too many add-load and mul-load commands, rather 
than mul-adds. 

An important issue is the additional degradation of these performance characteristics 
through communication. Two features have a large impact on this latter loss: 1. The 
size and shape of the subgrids residing on the individual processors - the relevant 
surface effects in communication can be influenced by the programmer's layout; 2. 
the latency of the cshift operations during run time|. As a result we flnd, on the 
CMS, the performance P(8'^) for the 8^ lattice to be only 6S% of P(16^). For this 
reason the data presented here refer to 16^ lattices only. 

In table 1 we compare performance flgures from CM2 and CMS, as measured on 
the Cornell algorithm. Notice that interprocessor-communication is needed when 
calculating the quantity J2dfj.Afj^ according to Eq. 6 conflguration {t/^^}, and when 
performing the gauge transformation 

Uf,{x) — > G{x)U^{x)G{x + fi). 

In table 2 we present the corresponding performance data from the Los Alamos 
algorithm. Within this method, the gauging step proper is carried out locally. After 
each such step, however, one must rearrange the links from red (or black) into 
black (or red) ordering. This gather/scatter stage involves the communication and 
deteriorates the performance from the pure gauging step, which runs at 1 Gflops 
on the CMS. The communication overhead being nearly 60% of run-time on both 
machines, the floprate is flnally degraded to an average of 304 Mflops on the CMS, 
which is merely 8% of the theoretical peak performance. Note, that the computing 
of (Ti is more expensive in this case, due to the data structure. 

*For SU{3) matrix multiplication, e.g., a performance gain of up to 30% may be reached by 
programming in DPEAC (CDPEAC) on the CM2 (CM5), respectively. 

^The present implementation of CMF on the CMS suffers particularly from the large latency 
time of 300 msec. 
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The situation appears to be more favorable for the Cornell method, where the time 
required for communication is 37% on the CMS, and 27 % on the CM2, which leads 
to an overall performance of 530 MFlops for the CM5 (198 MFlops for the CM2). 



CM2 CM5 



Subroutine 


MFlops time in % 


MFlops time in % 


SU(3)-projection 
transformation 


166.5 18.45 
556.8 1.1 

301.6 45 
68 34.75 


409.7 20.1 
1481.8 1.1 

918.5 39.5 
163.1 38.8 



Table 1: Benchmarking the Cornell Method 



CM2 CM5 



Subroutine 


MFlops time in % 


MFlops time in % 


Computation of G 
transformation 


343.85 16.6 
84.5 17.6 

256.3 0.4 
367 8.86 


1026 16.5 
210 20 
579.8 0.55 
1202 7.8 



Table 2: Benchmarking the Los Alamos Method 



Wc should mention, that the results presented in this paper arc based upon a test 
version of the software where the emphasis was on providing functionality and the 
tools necessary to begin testing the CM5 with vector units. The CM5 software 
release has not had the benefit of optimization or performance tuning and, conse- 
quently, is not necessarily representative of the performance of the full version of 
this software. 
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